Confident Learning

Published

July 8, 2022

This page contains my reading notes on

Confident Learning: Estimating Uncertainty in Dataset Labels

Notations:

All symbols with * are related to the unknown, true labels.
All symbols with \sim are related to the given, noisy labels.
All symbols with ^ are related to the estimates (the given model).

The procedure needs 2 inputs:

Out-of-sample predicted probabilities \hat{\mathbf{P}}: a matrix of n rows (# of training instances) and m columns (labels).
- CL requires users to train a model on the training set using cross validation.
- The model must be able to provide probability outputs to all possible labels.
The given labels \tilde{\mathbf{y}}: a vector of length n (# of training instances).

Five Methods to identify instances with noisy labels

1. CL baseline 1: C_{confusion}

The instance is considered to have the noisy label if its given label is different from the label with largest predicted probability.

2. CL method 2: C_{\tilde{y}, y^{*}}

In this method, a matrix called confident joint C_{\tilde{y}, y^{*}} will be calculated using \hat{\mathbf{P}} and \tilde{\mathbf{y}}.

C_{\tilde{y}, y^{*}}	y^{*} = 0	y^{*} = 1	y^{*} = 2
\tilde{y} = 0	100	40	20
\tilde{y} = 1	56	60	0
\tilde{y} = 2	32	12	80

To calculate this matrix:

For each label j, calculate the average predicted probability t_{j} using \hat{\mathbf{P}}.
For each instance \mathbf{x}_{k} with the given label i in the training set, the entry at row i and column j of the confident joint matrix C_{\tilde{y}=i, y^{*}=j} will be added 1, where the true label j is the one that has the largest predicted probability among all the labels whose predicted probabilities are above the respected t_{j}.
- This basically means that the true label for a given instance is the label whose predicted probability by a model is larger than the average predicted probability.
- If there are more than one such labels, chose the one that has the largest predicted probability.
- It is possible that no such label exists, and thus the instance won’t be counted in the matrix.

Thus, each entry in C_{\tilde{y}, y^{*}} is corresponding to a set of training instances.

All instances that fall in the off-diagonal of the C_{\tilde{y}, y^{*}} are considered to have noisy labels.

3. CL method 3: Prune by Class (PBC)

In this method and all methods below, another matrix called Estimate of joint \hat{Q}_{\tilde{y}, y^{*}} will be calculated using C_{\tilde{y}, y^{*}}.

\hat{Q}_{\tilde{y}, y^{*}}	y^{*} = 0	y^{*} = 1	y^{*} = 2
\tilde{y} = 0	0.25	0.1	0.05
\tilde{y} = 1	0.14	0.15	0
\tilde{y} = 2	0.08	0.03	0.2

\hat{Q}_{\tilde{y}, y^{*}} basically is the normlized C_{\tilde{y}, y^{*}}: each entry in C_{\tilde{y}, y^{*}} is divided by the total number of training instances.

For each class i, the a number of instances with lowest predicted probabilities for label i are considered to have noisy labels, where a is calculated as the product of n and the sum of off-diagonal entries on row i of \hat{Q}_{\tilde{y}, y^{*}}.

4. CL method 4: Prune by Noise Rate (PBNR)

For each off-diagonal entry in \hat{Q}_{\tilde{y}, y^{*}}, the n \times \hat{Q}_{\tilde{y}=i, y^{*}=j} number of instances with largest margin are considered to have noisy labels, where the margin of an instance \mathbf{x}_{k} with respect to given label i and true label j is \hat{\mathbf{P}}_{k, j} - \hat{\mathbf{P}}_{k, i}.

5. CL method 5: C + NR

The instance is considered to have a noisy label if both PBC and PBNR consider it to have a noisy label.